Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model

Background Viral infections have been the main health issue in the last decade. Antiviral peptides (AVPs) are a subclass of antimicrobial peptides (AMPs) with substantial potential to protect the human body against various viral diseases. However, there has been significant production of antiviral vaccines and medications. Recently, the development of AVPs as an antiviral agent suggests an effective way to treat virus-affected cells. Recently, the involvement of intelligent machine learning techniques for developing peptide-based therapeutic agents is becoming an increasing interest due to its significant outcomes. The existing wet-laboratory-based drugs are expensive, time-consuming, and cannot effectively perform in screening and predicting the targeted motif of antiviral peptides. Methods In this paper, we proposed a novel computational model called Deepstacked-AVPs to discriminate AVPs accurately. The training sequences are numerically encoded using a novel Tri-segmentation-based position-specific scoring matrix (PSSM-TS) and word2vec-based semantic features. Composition/Transition/Distribution-Transition (CTDT) is also employed to represent the physiochemical properties based on structural features. Apart from these, the fused vector is formed using PSSM-TS features, semantic information, and CTDT descriptors to compensate for the limitations of single encoding methods. Information gain (IG) is applied to choose the optimal feature set. The selected features are trained using a stacked-ensemble classifier. Results The proposed Deepstacked-AVPs model achieved a predictive accuracy of 96.60%%, an area under the curve (AUC) of 0.98, and a precision-recall (PR) value of 0.97 using training samples. In the case of the independent samples, our model obtained an accuracy of 95.15%, an AUC of 0.97, and a PR value of 0.97. Conclusion Our Deepstacked-AVPs model outperformed existing models with a ~ 4% and ~ 2% higher accuracy using training and independent samples, respectively. The reliability and efficacy of the proposed Deepstacked-AVPs model make it a valuable tool for scientists and may perform a beneficial role in pharmaceutical design and research academia.


Introduction
Viruses are serious and ubiquitous pathogens that cause several high rates of infections and mortality in humans and animals [1].Viral infections can affect the species for a longer time because of their different variations in transmission, genetic variations, and effective survival in the host cells [2].Recently, the prevalence of zoonotic viruses such as Zika, Ebola, and the novel SARS-COV-2 causes chronic and killer diseases [3].Presently, hundreds of different antiviral medications have been developed for treating other families of viruses, i.e., HIV, rhinoviruses, herpes, hepatitis B-C, influenza, etc. [4].The prevention of viral diseases is challenging owing to inadequate antiviral therapies and a lack of state-of-the-art viral pathogens.Traditional medications suffer from inefficiency, high side effects, and time-consuming procedures [5].Antiviral peptides (AVPs) are considered one of the key classes of antimicrobial peptides used in developing novel peptide-based powerful therapeutics for treating different viral infections.AVPs are small peptides that can be synthetically obtained using twenty amino acids or chemical clusters into natural peptide samples [6].AVPs have numerous characteristics, i.e., low side effects, high efficiency, low molecular weight, and low toxicity.It can be widely applied in producing innovative antiviral therapeutics [7].
With huge the growth in genomics data in recent decades, computational intelligence-based data-driven have attained great attention and are considered an alternative for predicting various therapeutic functions in bioinformatics.Consequently, different machine-learning models have been developed for predicting antiviral peptides (AVPs).Initially, Thakur et al. developed the AVPpred model by applying amino acid composition, sequence alignment, physiochemical properties, and motif search for feature formulation [8].The extracted spaces were trained via the SVM model using a tenfold cross-validation test.AVPpred was trained and validated using two different datasets.Similarly, Chang et al. employed the random forest model by incorporating other sequence encoding methods such as compositional, aggregation, secondary structured, and physiochemical properties [9].Later on, AVP-IC50Pred applied four different machine learning classifiers using the binary profile, residue composition, and structural features for predicting activity related to AVPs [10].Furthermore, Nath et al. applied a stacked-ensemble classifier using alignment scoring and an evolutionary descriptors-based feature encoding approach [11].Lissabet et al. developed the AntiVPP 1.0 predictor for AVPs [12].The residue composition and relative frequency-based encoding techniques were applied to obtain features from peptide samples.The obtained vector was trained and validated using a random forest (RF) model.Similarly, the PEPred-Suite model utilized adaptive formulation methods for peptide samples for predicting eight different functional types of therapeutic peptides [13].Whereas the two-level feature selection was employed using the ensemble RF model.Similarly, HybAVPnet presented a two-step training approach for predicting AVPs [14].The eighteen different encoding techniques were evaluated using light-GBM and neural network models.In the training phase of the HybAVPnet model, the predicted probabilities of the step-1 classifiers were provided to the SVM model for evaluating resultant outcomes.Akbar et al. proposed an ensemble classifier using the transformed-evolutionary and SHAP feature selection-based model for predicting AVPs [15].Meta-iAVP presented a stacking approach using the predicted scores of SVM, KNN, GLM, RF, regression trees, and XGboost models [16].Different frequency and amphiphilic-pseudo amino acid compositions were applied for the numerical representation of peptide samples.Pang et al. developed the AVPIden model for predicting the peptide samples with antiviral activities from six different virus families with eight types [17].Additionally, AVPIden used physiochemical properties, frequency, and gapped-compositional features for peptide representation.Recently, Lin et al. developed AI4AVP for AVPs by training deep convolutional neural networks using a variety of formulation methods [18].
After carefully observing all the above-mentioned studies, we found that each model performs a significant and active role in predicting AVPs.However, these methods are still suffering from reliability and generalization problems.Most existing models applied sequential encoding schemes that only target the residue composition of the individual amino acids without preserving the sequence order information.Some models proposed traditional evolutionary feature descriptors, which are very time-consuming to calculate for each protein sample by searching databases.Additionally, from a training point of view, the existing models were mainly focused on traditional machine learning (ML) based trained models.In contrast, recently, training models via ensemble learning outperformed traditional ML models in bioinformatics.Therefore, we choose a stacked ensemble classifier to effectively train the model using diverse feature representations.The main advantages of stacking over other ensemble classifiers, such as boosting and bagging, include its ability to capture diverse patterns in the input data and leverage the strengths of baseline classifiers.In the case of small training datasets, the stacked ensemble models have shown better performance than boosting and bagging.Moreover, if the relationships between features are complex and cannot easily captured by individual learners, then stacking has an edge over other ensemble classifiers.
In this paper, we proposed a stacked-ensemble model, Deepstacked-AVPs, for predicting AVPs.The peptide sequences were numerically formulated using Composition/ Transition/Distribution with Transition (CTDT) based physicochemical properties and a word2vec-based skipgram model for capturing semantic information using different k-mer.Apart from these, we developed a novel Position-Specific Scoring Matrix trisegmentation to form an improved evolutionary matrix named PSSM-TS.A multi-feature is generated using CTDT, K-mer, and PSSM-TS vectors.The Information gain (IG) based feature selection is then employed to gather optimal features for effective prediction of the targeted class.The training process of the proposed model consists of two phases.Initially, four base models, i.e., random forest (RF), extra tree classifier (ETC) XGBoost (XGB), and Deep neural network (DNN), are applied for training individual models.Then, a stacked ensemble-based meta-classifier is formed through logistic regression (LR) using the predicted scores of the individual classifiers [19].The proposed Deepstacked-AVPs performed remarkably by exhibiting superior performance on both the training and independent datasets.The complete framework of the proposed Deepstacked-AVPs model is depicted in Fig. 1.

Dataset description
In bioinformatics, selecting an appropriate training dataset is crucial in developing an automatic intelligent model [20][21][22].The benchmark dataset selection significantly impacts a computational model's performance.The training dataset used in this study was initially constructed in the AVPpred predictor [8].However, while preparing the dataset, the unnecessary letters such as 'B' , 'U' , and 'X' were eradicated from the peptide sequences.The used training dataset consists of 951 samples, where 544 are AVPs and 407 are non-AVPs.Moreover, a similar training dataset has been applied for developing various models, such as Chang et al. [9], and AntiVPP 1.0 [12].Additionally, an independent dataset was used to examine the reliability and generalization of our training model.The independent dataset comprised unseen samples with 60 AVPs and 60 non-AVPs sequences [23].The selection of independent samples ensured no overlapping between the training and independent datasets to validate the overfitting of our model.

Position-specific scoring matrix using tri-segmentation (PSSM-TS)
Position-Specific Scoring Matrix (PSSM) can represent the evolutionary profile of the amino acid sequences.However, a simple PSSM vector cannot calculate the sequence ordering information of the local residues [24].Moreover, the recent computational models observed that local residues of the PSSM descriptor represent the high discriminative and reliable features that lead to achieving high predictive outcomes for different biological problems [25][26][27][28].Hence, we developed a tri-segmentation (TS) concept into a PSSM vector [29].The PSSM information is divided into three segments by row with equal dimension size.Then, each segment is individually where ψ signifies the no; of slices and F ψ denotes the residue type of twenty natural amino acid residues in Seg-PSSM.The Tri-segmentation PSSM (TS-PSSM) can be represented as: The dimension vector of the proposed TS-PSSM is 60D.

Word2Vec-based word embedding
In the word2vec approach, the contextual relationships among words are captured, yielding distributed representations that encode various linguistic regularities and patterns [30].Within the protein-encoding process, the segments of k amino acids, commonly referred to as k-mer for treating individual lexical units.Each peptide sequence was segmented into k-mer using the window method, which has been widely employed in natural language processing [31,32].In this work, we used a skip-gram model for word representations using different k-mers that are used for the prediction of other words within the peptide sentence.In a given corpus, the skip-gram model is used for training word vectors of each word.For a word (S(a)) within a sentence, skip-gram can predict the probabilities P(S(a + i)|S(a)) of the neighboring words S i (a − k ≤ i ≤ a + k) depending on the probability of the current word S(a) , as shown in Fig. 2. Whereas, every word space represents the residue of the neighboring words.The key objective of the skip-gram is to maximize the value of E as follows: (1) where the parameter k represents the window size, S(a + i) ( −k ≤ i ≤ k ) signifies the K-words neighboring to the current word S(a) , and the term n shows the total number of words.
As mentioned above, that word2vec can capture the residue relationships of the words within the amino acid sample and preserve structural information, we considered the k-mers as "words".Finally, from each sample, a word embedding vector of 100D is extracted using the skip-gram model.

Composition/transition/distribution with transition (CTDT)
CTDT is a physiochemical properties-based global distribution method of the peptide sequences [33].CTDT represents the structural and biochemical characteristics of protein sequences based on different types of groups.The last T in CTDT signifies the transition among three groups of amino acid properties: hydrophobic, polar, and neutral.More specifically, the occurring frequencies of these groups are computed [34,35].For two adjacent amino acid residues (r, s) , the CTDT features can be calculated as follows: where the residue pairs (r, s) ∈ (positive, neutral), neutral, negative , negative, positive , N (r, s) and N (s, r) represents the frequency counts composed of " r, s " and " s, r " within the protein peptide sequence.The analysis was performed using thirteen distinct physicochemical properties.The resultant CTDT feature vector comprises 39 features against each sample.

Information gain
In the feature engineering phase, redundant features can affect the predictive accuracy of a training model [36].Therefore, feature selection techniques are commonly employed to identify and choose the most relevant features from the extracted space for improving the classification rates with minimum computational cost [37,38].In this work, we used information gain (IG) as a feature selection for identifying the most significant features by examining the reduction in entropy by splitting training samples based on a value of a random attribute [39].The high IG value represents the low entropy.
For given training samples denoted by " S " with attributes represented by A , the IG ( S,A ) associated with the attribute A can be defined by reducing the entropy observed within the training samples when the attribute A is considered [40].Which can be expressed mathematically represented as follows: H(S) represents the entropy of the training samples S , and H S A signifies the entropy of the training samples S under the condition that the attribute A has been (3 observed.Which is particularly relevant in the classical scenario of a dichotomous classification: and here the notation Values(A) denotes the collection of all possible values associated with the attribute A .Additionally, S v stands for the partition of the training dataset that cor- responds to the specific value "V" of the attribute A .The entropy of this partition is cal- culated by H (S v ) .The vertical bars (|.|) represent the cardinality operator [40].
In this work, 89 optimal features were selected using IG-ranking-based feature selection.Where the ranking of the features was established in descending order.This ranking assigns the highest priority to the most substantial Information gain attribute.

UMAP-based features visualization
To demonstrate the effectiveness of the extracted features of our proposed model, we employed a Uniform Manifold Approximation and Projection (UMAP) based statistical visualization technique [41].UMAP is a data visualization approach that is used to preserve not only local structural information but also global structural relationships.In data visualization, the samples of two classes as represented as two different clusters.Which is useful for comprehending the relationship among samples of different classes used for discrimination of peptides.In this study, we performed the UMAP visualization of the extracted features of the training samples i.e., CTDT, Word2Vec, PSSM-TS, Hybrid features, IG-based optimal features, and independent samples as shown in Fig. 5.

Model architecture of deepstacked-AVPs model
The Stacking model used in this study mainly comprises two phases [42].Initially, the baseline models, namely ETC [43], RF [44], XGB [28], and DNN [45], are trained based on the extracted vectors from the training dataset.The grid search approach is employed to select the optimal model training parameters.In the case of the DNN model, two dense layers were added after the input layer to facilitate feature matrix extraction.The Rectified Linear Unit (ReLU) activation function is also applied to deal with nonlinearity in the two dense layers.The Adam and early stop and dropout strategies are also used to reduce the risk of overfitting.
In the second phase, a meta-classifier is formed using logistic regression (LR) by computing the predicted probability scores of the baseline classifiers.The probability scores are in the range of (0-1).Where a threshold of 0.5 is used to define the predicted class of a protein sample, i.e., the probability > 0.5 will predict label 1, and < 0.5 will predict label 0 class.Implementing the LR-based stacked-ensemble classifier has significantly increased predictive rates compared to individual baseline models.

Experimental configuration
Our proposed model is established through the utilization of Intel (R) Xeon(R) @ 3.3 GHz, providing a RAM of 64 GB.For code implementation, we utilized the Windows 10 operating system and Python 3.10.6 as the programming language.Additionally, several Python libraries were used in the model training process.

Performance evaluation
In bioinformatics and applied machine learning, various performance parameters are utilized to assess the predictive capabilities of the training models [46,47].Mostly, in binary class problems, a confusion matrix is formed to store the prediction outcomes of the training models, i.e., True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).The predictive accuracy is often the primary metric for model effectiveness assessment [48,49].However, to comprehensively evaluate a model, we employed the following performance evaluation metrics to assess our model more rigorously.
where Acc, Sn, Sp, and MCC represent the accuracy, sensitivity, specificity, and Matthew's coefficient correlation, respectively.

Results and discussion
In this study, we evaluated our model using a fivefold cross-validation test, where data in each fold was randomly selected [50].Additionally, to achieve reliable results from the random distribution of data among the folds, we selected the mean value of the fold-CV test by repeating the stratified loop procedure 50 times [50,51].In the below subsections, we will thoroughly discuss the classification outcomes of the individual and ensemble training models before and after applying feature selection.

Prediction analysis of classification models using training samples
The prediction results of the individual feature vectors using training peptides are given in Table 2.We mentioned above that we employed three different extraction methods, CTDT, word2vec, and PSSM-TS to numerically transform the peptide samples.At first, the extracted vectors were evaluated individually using four machine-learning models: RF, ETC, XGB, and DNN.Whereas, the optimal parameters used for training the individual machine-learning models are provided in Table

Prediction analysis of classifiers using information gain based feature selection
In bioinformatics-based machine learning models, feature selection performs a key role by proposing a cost-effective computational model.Selecting highly relevant features from the training vector improves performance [52,53].As described in Table 2, the hybrid vector achieved higher results than the individual vectors.Hence, we applied information gain (IG) for selecting highly relevant features from the hybrid vector.The hybrid vector (CTDT + 3-mer + PSSM-TS) contains 199 features with 39 features of CTDT, 100 features of 3-mer, and 60 features were obtained using PSSM-TS.After employing IG feature selection, only 50 optimal features were selected from the whole hybrid vector.Like the training samples, the selected feature vector was also evaluated using RF, ETC, XGB, and DNN classifiers.The evaluation results of the chosen vector are provided in Table 3, which have effectively shown their contribution by accurately discriminating the targeted classes.In the case of individual learning models using selected features, the ETC, XGB, and DNN achieved accuracies of 89.44%, 89.16%, and 89.90%.In comparison, the RF training model performed well by reporting an accuracy of 91.28% and an AUC of 0.96.Furthermore, the Stacked-Ensemble model boosted the results by obtaining an accuracy of 96.60%, with sensitivity, specificity, and MCC values of 94.85%,  97.35%, and 0.92, respectively.The performance of the individual features, hybrid features, and optimal features using training samples are compared in Fig. 4. The higher predictive results show the effectiveness of selecting highly discriminative features via IG feature selection.Moreover, the proposed method is validated using an independent dataset to show its generalization power as provided in Table 4.The instance-based AUC and precision-recall (PR) analysis of the training and independent features are provided in Fig. 3.

Comparison of deepstacked-AVPs model with existing predictors
To assess the efficacy of our study, we conducted a comparative analysis of our predictor with existing models using training and independent datasets as shown in Table 5.In the case of training samples, the AVPpred model using sequential and physiochemical properties based motif descriptors ACC of 85%, with Sn, Sp, and  MCC of 82.20%, 88.20%, and 0.70, respectively [8].Similarly, Chang et al. trained the RF model using the same training samples using compositional residue encoding, aggregation, and secondary structured features [9].Their model achieved an ACC of 85.10%, Sn of 86.60%, Sp of 83%, and MCC of 0.70.Meta-iAVP used physiochemical properties based on computational features by applying the stacking concept [16].The predicted probabilities of six machine learning models were provided to the stacking algorithm, and obtained an ACC of 88.20%, Sn, Sp, and MCC of 89.20%, 86.90%, and 0.76, respectively.Further, the FIRM-AVP predictor obtained the optimal ranking based on numerical features using the structural and physicochemical properties of peptides [54].FIRM-AVP reported an ACC of 92.40%, Sn of 93.30%, Sp of 91.10%, and MCC of 0.84.In contrast, our proposed Deepstacked-AVPs model outperformed by improving the predictive rates of 4.2%, 6.25%, and 0.08 higher ACC, Sp, and MCC, respectively.Apart from these, using an independent dataset, the AVPpred model obtained 92.50% ACC, with 93.30% sn, 91.70% sp, and MCC of 0.85 [8].Chang et al. reported an ACC of 93.30%, and sp of 95% [9].Likewise, Meta-iAVP achieved an ACC of 94.90%, Sp of 98.30%, and MCC of 0.90 [16].Furthermore, AntiVPP 1.0 using independent samples achieved an ACC of 93%, Sp of 97%, and MCC of 0.87 [12].While our proposed Deepstacked-AVPs model surpassed the existing model, by

Discussion
AVPs are one major class of antimicrobial peptides used for developing novel peptidebased therapeutics to treat various viral infections.Existing traditional laboratory-based methods are laborious and inefficient due to their limited reliability.In this study, we introduce a stacked ensemble model namely, Deepstacked-AVPs, for the accurate discrimination of AVPs and non-AVPs.Initially, four different baseline models were trained using PSSM-TS-based improved evolutionary features, CTDT-based physiochemical properties, and word2Vec-based features.The predicted probability scores of the baseline models are provided for logistic regression to form a stacked ensemble model.
To further investigate the extracted features, the hybrid vector is examined using the stacking model, resulting in an ACC of 92.20%, Sp of 93.91%, and AUC of 0.96.The hybrid vector has shown substantial improvement in terms of all evaluation parameters by compensating for the weakness of the individual vectors as shown in Fig. 6.However, to develop a fast training model with minimal computational cost, IG is applied to choose 89 optimal features.The optimal feature set has shown further improvement by achieving an ACC of 96.60%, Sp of 97.35%, and AUC of 0.98.Our proposed model, using training samples, reported 4% higher ACC, 6% higher Sp, and 8% higher MCC than existing state-of-the-art predictors, as provided in Table 5.The generalization and overfitting of the Deepstacked-AVPs model is validated using independent samples and reported improved ACC, Sn, Sp, and MCC of ~ 2%, 9%, 2%, and 3%, respectively.Hence, representing the peptide samples using a multi-informative vector, specifically formulating segmented local features using the novel PSSM-TS approach, and leveraging the powerful training abilities of the stacked-ensemble model significantly improved the However, the stacking model still suffers from several issues such as the risk of model dependency, computational cost, and hyperparameter tuning.In the future, we will focus on handling these issues.

Conclusion
In this paper, we developed a Deepstacked-AVPs model to predict antiviral peptides effectively.Keeping the limitations in the existing feature formulation techniques, we numerically represented the amino acid samples using word2vec-based word embedding, PSSM-TS-based improved evolutionary features, and CTDT-based physiochemical properties methods.A multi-informative vector is formed by fusing Word2vec, PSSM-TS, and CTDT vectors.An information gain scheme is applied to develop a computationally-effective model by choosing the optimal feature space from the hybrid vector.Subsequently, the Deepstacked-AVPs meta-model was trained using the probability scores of the individual classifiers.The Deepstacked-AVPs model exhibits consistency and stability by achieving superior accuracy of 96.60% on the training samples and 95.15% on the independent dataset.Our model has outperformed state-of-the-art methods and will significantly contribute to antiviral peptides-related drug design and the pharmaceutical industry.

Fig. 2
Fig. 2 The architecture of the skip-gram model

Fig. 3 A
Fig. 3 A ROC analysis of Training samples, B PR analysis of Training samples, C ROC analysis of independent samples, D PR analysis of independent samples

Fig. 5
Fig. 5 UMAP visualization of training samples using A CTDT, B Word2Vec, C PSSM-TS, D Hybrid features, E IG-based optimal features and F independent samples

Table 1
Hyper parameters of classifiers learning model

Table 2
Predictive outcomes of the training dataset via different feature descriptors

Table 3
Predictive outcomes of the hybrid vector of training samples after applying information gain

Table 4
Predictive outcomes of deepstacked-AVPs model using independent dataset
Fig. 4 Comparison of individual, hybrid, and optimal vectors using A Training samples, B